Data Analysis and Visualisation with R

1 Learning objectives

  • Get familiar with an IDE (Integrated development environment).
  • Know how to set up a data research project.
  • Visualise data: choose an appropriate visualisation for your data; understand ggplot2’s layered approach.
  • Be able to read in, process, and store data.
  • A few advanced topics: functions; conditional execution, joins, loops.

In this doc, all code chunks are folded. Readers are encouraged to immediately look at the code for Sections 5 and 6. From Sections 7 onwards, the reader can start to jot down some initial code before seeing the solution.

Let’s manage expectations:

  • I tailored this workshop for a 2 hours session. Depending on the overall level, we might not cover all the content - that’s ok, you still get this document if you want to explore further.
  • I won’t be able to enter into any detail about R itself, statistics, or visualisation - and that’s also ok.
  • The main objective here really is to give you a flavour of a few of the things that can be currently achieved with the kinds of software and hardware we have at our disposal.

I assume that you are now staring at an IDE - be it RStudio or VS Code - as per the email that was shared with you. Please shout if that is not the case!

2 Intro

For this workshop, we borrow heavily from Data Visualization for Social Science. This very markdown is based off the R package accompanying the book. You can use it to take notes, write your code, and produce a good-looking, reproducible document that records the work you have done.

At the very top of the file is a section of metadata, or information about what the file is and what it does. The metadata is delimited by three dashes at the start and another three at the end. You can/should change the title, author, and date to the values that suit you. Keep the output line as it is for now, however. Each line in the metadata has a structure. First the key (title, author, etc.), then a colon, and then the value associated with the key. It is very picky when it comes with indentations, characters used, etc..

2.1 This is a Quarto File

Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents, and is also the basis for Quarto.

When you click the Render button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. A code chunk is a specially delimited section of the file. You can add one by moving the cursor to a blank line choosing Code > Insert Chunk from the RStudio menu. When you do, an empty chunk will appear:

Code chunks are delimited by three backticks (found to the left of the 1 key on US and UK keyboards) at the start and end. The opening backticks also have a pair of braces and the letter r, to indicate what language the chunk is written in (this is a clue to the fact that you can have multiple programming language in the document, such as Python, or Bash, or SQL). You write your code inside the code chunks. Write your notes and other material around them, as here.

3 Know your editor

Please have a look at your screen. Some of the most important things to learn now:

  • Console
  • Editor
  • Environment

For this workshop, we rely on RStudio. An alternative you may explore is VS Code.

4 Before you begin to explore and analyse, set up your workspace

A number of functions and packages come already installed in what is generally referred to as R base (and you can actually see what’s in it by typing base:: in your console, and let the auto-complete suggest you the full list of functions … ). However, most of the things we’ll do in this workshop need further libraries. To install them, make sure you have an Internet connection. Then manually run the code in the chunk below. If you just render the document, this will be skipped - however, if this is the first time you run this, make sure to manually run the code chunk below. We do this because you only need to install these packages once, not every time you run this file. Either knit the chunk using the little green “play” arrow to the right of the chunk area, or copy and paste the text into the console window.

4.1 Load Libraries

Once you installed your libraries, each time your run this document you must load them. If we do not load them, R will not be able to find the functions contained in these libraries. The tidyverse includes ggplot2 (for data visualisation) and other libraries such as dplyr (for data manipulation). We also load the socviz and gapminder libraries.

Notice that here an option is set: include=FALSE. This tells R to run this code but not to include the output in the final document.

When you click the Render button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Show the code
gapminder::gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

A final note about the here library. That is a great help if you have properly set up your project as I just indicated. All paths will become relative, and you can forget about remembering where you are in your machine. The essential thing is that it relies on the .RProject file that is created automatically for you when you create a Project with RStudio.

5 Data Visualisation

5.1 Few initial words

R has a consolidated role in data visualisation, well beyond the initial remit of statistics. A few examples:

5.2 Let’s start with some actual code

Let’s go back to the gapminder data we saw before. What kind of variables are we dealing with? Let’s remind ourselves.

Show the code
head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

Technically, this data is in a long format, and you can immediately spot it by looking at the repeated values in the first 2 columns. An example of a data in wide format is provided by Eurostat.

We can explore the data in a more structured way, by checking what classes do we have as columns, and get some descriptive statistics.

Show the code
str(gapminder) # classes and first values
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
Show the code
glimpse(gapminder) # very similar, tidyverse way
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
Show the code
summary(gapminder) # main stats
        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1  
                                       

Notice how we use functions in R: we call a function - in the chunk above, str, glimpse, summary - and then include an argument between parentheses - in this case, the entire dataset.

Let’s now start to analyse the data. Say that we have an hypothesis regarding how life expectancy (lifeExp) changes in relations with GDP per capita (gdpPercap). More precisely, we may think that as GDP per capita increases, so should life expectancy. Graphically, if we look back at our data, we should see dots amassing in lower-left and upper-right sections of our graph. As these look like continuous variables, we could plot them in a scatterplot. There are at least two ways to create a plot within ggplot. As you’ll see, there are usually multiple ways to achieve the exact same output with these programming languages.

  • We can assign a ggplot plot to an object called p (the name doesn’t matter, we can call it mike if we want to), and then add (literally, using +) layers on top of it.
Show the code
p <- ggplot(
    data = gapminder,
    mapping = aes(
        x = gdpPercap,
        y = lifeExp
    )
)
p + geom_point() # and you can keep on adding to this, see below ...

  • Another way of achieving the same thing is as follows, which is more in line with a tidyverse style which heavily rely on pipes (|>).
Show the code
gapminder |> # pipe
    ggplot(aes(x = gdpPercap, y = lifeExp)) + # layered ggplot approach
    geom_point()

Let’s unpack what happens in the code above:

  • we grab the data by calling gapminder
  • on this data, we apply a function, namely ggplot
  • calling a function on that data means we get the exact same output as the p object above. Thus, we can add a layer on top of that ggplot object, that is the geom_point.

Let’s go back to the graph now. Sometimes, when we face a cloud like the one above, it is useful to plot a trendline on top, to understand the overall pattern.

Show the code
# geom_smooth() using method = 'gam' and formula 'y ~ s(x, bs = "cs")' - GAM stands for generalised additive model
p + geom_point() +
    geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Perhaps that is too wiggly. Using method='lm' (linear model) as an argument to geom_smooth() includes a simple OLS regression. This also should make you aware of the danger of extrapolating from predictions of inappropriate models.

Show the code
# another way to get the same result
ggplot(
    data = gapminder,
    mapping = aes(
        x = gdpPercap,
        y = lifeExp
    )
) +
    geom_point() +
    geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

Data is quite bunched up against the left side. Gross Domestic Product per capita is not normally distributed across our country years. The x-axis scale would probably look better if it were transformed from a linear scale to a log scale. For this we can use a function called scale_x_log10(). As you might expect this function scales the x-axis of a plot to a log 10 basis. To use it we just add it to the plot:

Show the code
p <- ggplot(
    data = gapminder,
    mapping = aes(
        x = gdpPercap,
        y = lifeExp
    )
)
p + geom_point() +
    geom_smooth(method = "gam") +
    scale_x_log10()
`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

The labels on the tick-marks can be controlled through the scale_* functions. Please notice that this we are using another library to have our tick labels in the dollar format.

Show the code
p + geom_point() +
    geom_smooth(method = "gam") +
    scale_x_log10(labels = scales::dollar)
`geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

The viz above already is way more informative than the ones before, but there is more we can extract from the underlying data. For instance, now it may make sense to re-introduce the lm method we used before.

In passing, also notice the use of different parameters in the viz.  The se option commands the standard error, and we can switch it off by imputing se=FALSE. The alpha regulates the transparency of the objects, from 0 to 1. Further, we can add info and make more it publication ready.

Show the code
ggplot(
    data = gapminder,
    mapping = aes(
        x = gdpPercap,
        y = lifeExp
    )
) +
    geom_point(alpha = 0.3) +
    geom_smooth(color = "orange", se = FALSE, linewidth = 1, method = "lm") +
    scale_x_log10(labels = scales::dollar) +
    labs(
        x = "GDP per capita (in log scale)",
        y = "Life Expectancy in Years",
        title = "Economic Growth and Life Expectancy",
        subtitle = "Data points are country-years",
        caption = "Source: Gapminder."
    ) +
    theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Another way to think about the underlying data is to remind ourselves about the groups we have, represented by the continent variable. In the code below, we include colours as an aesthetic, and map it on continents. Notice that, by including also a colour parameter in the geom_smooth function, we are overriding the colour mapping in the ggplot function. This has important consequences, and is a further example of the sequential execution typical of many programming languages (left-right, top-bottom).

Show the code
ggplot(
    data = gapminder,
    mapping = aes(
        x = gdpPercap,
        y = lifeExp,
        color = continent
    )
) +
    geom_point(alpha = 0.3) +
    geom_smooth(color = "orange", se = FALSE, linewidth = 1, method = "lm") +
    scale_x_log10(labels = scales::dollar)
`geom_smooth()` using formula = 'y ~ x'

Indeed, see what happens when we do not override it. We can immediately appreciate that the slope of our regression is radically different depending on the groups - consider this your introduction to the dangers of pooling.

Show the code
ggplot(
    data = gapminder,
    mapping = aes(
        x = gdpPercap,
        y = lifeExp,
        color = continent
    )
) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = "lm") +
    scale_x_log10(labels = scales::dollar)
`geom_smooth()` using formula = 'y ~ x'

However, this has become too cluttered, and it’s hard to understand what’s going on with all these colours and lines. We reached saturation and need to revert to a simpler plot.

Show the code
ggplot(
    data = gapminder,
    mapping = aes(
        x = gdpPercap,
        y = lifeExp
    )
) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = "lm", colour = "orange", fill = "grey90") +
    facet_wrap(~continent) +
    scale_x_log10(labels = scales::dollar) +
    theme_minimal() +
    labs(
        x = "GDP per capita",
        y = "Life Expectancy in Years",
        title = "Economic Growth and Life Expectancy",
        subtitle = "Data points are country-years",
        caption = "Source: Gapminder."
    )
`geom_smooth()` using formula = 'y ~ x'

What you observe above is variously termed as a small multiple visualisation, or faceted visualisation. The key point is that we break down the data in meaningful groups, and analyse them separately. This enables us to see more clearly similarities and differences. Remember, all analyses are comparisons.

A radically different take is to exploit the time dimension.

Show the code
gapminder |>
    # subset the data to just the rows with these continents
    filter(continent %in% c("Asia", "Africa", "Europe", "Americas")) |>
    ggplot(aes(x = year, y = gdpPercap)) +
    geom_line(color = "gray80", aes(group = country)) +
    geom_smooth(linewidth = 1, method = "loess", se = FALSE) +
    scale_y_log10(labels = scales::dollar) +
    facet_wrap(~continent, ncol = 2) +
    theme_minimal() +
    labs(
        x = "Year",
        y = "GDP per capita",
        title = "GDP per capita on Five Continents"
    )
`geom_smooth()` using formula = 'y ~ x'

A few takeaways from this.

  • First, our visualisation obviously brings the reader’s attention to the blue trend line by continent. For instace, the value at the end of the time series for the grouped trend line in Asia is roughly where the start of the grouped time series is in Europe.
  • Second, pay attention to the larger variation surrounding the trend line in continents such as Asia and Africa compared to the relatively tighter distribution in Europe. Inter alia, this would have substantial consequences were we to fit a model to such data: the larger variance in the former continents would mean larger uncertainty in our estimates compared to the European case. In another case that you may be familiar with, this could happen to parties’ polls, where we could have substantial variation in the polls translating into larger margin of errors in projections.
Show the code
rm(p)

6 UN Votes

Also in this Section the data comes directly from the library, exactly as in the case of gapminder. How many tables are we dealing with?

Show the code
library(unvotes)
If you use data from the unvotes package, please cite the following:

Erik Voeten "Data and Analyses of Voting in the UN General Assembly" Routledge Handbook of International Organization, edited by Bob Reinalda (published May 27, 2013)
Show the code
un_votes <- unvotes::un_votes
un_roll_calls <- unvotes::un_roll_calls

6.1 Counting categorical variables

If we want to get a quick glimpse of the distribution of votes, we could just simply count how many there are by unique value of the vote variable.

Show the code
un_votes |>
    count(vote, sort = TRUE)
# A tibble: 3 × 2
  vote         n
  <fct>    <int>
1 yes     693544
2 abstain 110893
3 no       65500
Show the code
with(
    un_votes,
    tapply(X = vote, INDEX = vote, FUN = length)
)
    yes abstain      no 
 693544  110893   65500 
Show the code
data.table::setDT(un_votes)[, .N, by = list(vote)]
      vote      N
    <fctr>  <int>
1:     yes 693544
2:      no  65500
3: abstain 110893

Say that we don’t particularly like this representation of the data, and we would like to recode the vote variable. We could achieve that like so.

Show the code
# recode
un_votes$vote_int[un_votes$vote == "yes"] <- 1L
un_votes$vote_int[un_votes$vote == "no"] <- -1L
un_votes$vote_int[un_votes$vote == "abstain"] <- 0L
# check
table(un_votes$vote_int, un_votes$vote, exclude = NULL)
    
        yes abstain     no
  -1      0       0  65500
  0       0  110893      0
  1  693544       0      0

6.2 Aggregations

Share of votes, by country. Check how you can do it in 2 ways, with a rough appreciation of the timing difference. I’ll walk you through just the tidyverse implementation.

  • we call the data
  • we group by a variable of interest
  • we squash all the data by the grouping variable, and calculate two new variables, namely the total votes, and the number of yes.
  • we close the grouped calculations by calling off the group_by - that is, ungroup()
  • we create a new variable, which is just the share
Show the code
library(tictoc) # library for timing

# aggregation: tidyverse style ------------------------------------------------#
tictoc::tic() # timing on

un_votes |>
    group_by(country) |> # group by country
    summarise(
        n_votes = n(), # count votes
        n_yes = sum(vote == "yes", na.rm = TRUE) # count yes
    ) |>
    ungroup() |> # ungroup operations
    mutate(pct_yes = n_yes / n_votes) # create new col, no summarise
# A tibble: 200 × 4
   country           n_votes n_yes pct_yes
   <chr>               <int> <int>   <dbl>
 1 Afghanistan          5604  4781   0.853
 2 Albania              4237  3007   0.710
 3 Algeria              5289  4666   0.882
 4 Andorra              2323  1539   0.663
 5 Angola               3739  3436   0.919
 6 Antigua & Barbuda    3344  3064   0.916
 7 Argentina            6132  4851   0.791
 8 Armenia              2361  1790   0.758
 9 Australia            6166  3465   0.562
10 Austria              5709  3683   0.645
# ℹ 190 more rows
Show the code
tictoc::toc() # timing off
0.032 sec elapsed
Show the code
# aggregation: data.table style -----------------------------------------------#
tictoc::tic()

data.table::setDT(un_votes)[
    , list(
        n_votes = .N,
        n_yes = sum(vote == "yes", na.rm = TRUE)
    ),
    keyby = list(country)
][
    , pct_yes := n_yes / n_votes
][]
Key: <country>
                     country n_votes n_yes   pct_yes
                      <char>   <int> <int>     <num>
  1:             Afghanistan    5604  4781 0.8531406
  2:                 Albania    4237  3007 0.7097003
  3:                 Algeria    5289  4666 0.8822084
  4:                 Andorra    2323  1539 0.6625054
  5:                  Angola    3739  3436 0.9189623
 ---                                                
196: Yemen People's Republic    2378  2137 0.8986543
197:              Yugoslavia    5474  4345 0.7937523
198:                  Zambia    5079  4679 0.9212443
199:                Zanzibar       2     0 0.0000000
200:                Zimbabwe    3546  3247 0.9156796
Show the code
tictoc::toc()
0.017 sec elapsed

6.3 Functions

As a more advanced topic, if you know you’ll execute that same function over and over, you can re-package that in a function. Notice a few things:

  • name/position of arguments
  • default values
  • calling functions explicitly from namespace
Show the code
# write your first function
summarise_votes <- function(data_in = un_votes, min_tot_votes = 10L) {
    data_in |>
        dplyr::summarise(
            n_votes = n(),
            n_yes = sum(vote == "yes", na.rm = TRUE)
        ) |>
        dplyr::filter(n_votes >= min_tot_votes) |> # filter by at least this many votes
        dplyr::mutate(pct_yes = n_yes / n_votes) |>
        dplyr::arrange(dplyr::desc(pct_yes)) # arrange by pct_yes
}

# test, we should get the same answer as above ... -------------------------------#
un_votes |>
    group_by(country) |>
    summarise_votes() |> 
    ungroup()
# A tibble: 199 × 4
   country             n_votes n_yes pct_yes
   <chr>                 <int> <int>   <dbl>
 1 Seychelles             2109  2060   0.977
 2 São Tomé & Príncipe    2686  2584   0.962
 3 Cape Verde             3941  3762   0.955
 4 Timor-Leste            1387  1320   0.952
 5 Guinea-Bissau          3595  3419   0.951
 6 Djibouti               4073  3829   0.940
 7 Mozambique             4152  3894   0.938
 8 Suriname               4049  3786   0.935
 9 Maldives               4644  4330   0.932
10 Belize                 3115  2904   0.932
# ℹ 189 more rows

7 EP Data

7.1 How many documents in each Committee?

Say that you want to organise the Group so that the Committees that receive the most files are also allocated more manpower. How would you go about answering this question?

  • Find the data.
  • Load it in your machine.
  • Process it.
  • Spit back the result.
Show the code
#------------------------------------------------------------------------------#
# Download and Process EP Open Data on Plenary Docs ----------------------------
#------------------------------------------------------------------------------#

#' Collects .csv files from the [EP Open Data Portal](https://data.europarl.europa.eu/en/datasets?language=en&order=RELEVANCE&dataThemeFamily=dataset.theme.EP_PLEN_DOC).
#' The code then proceeds to tidy such data and write to disk.

## Texts tabled: read csv & append it all together -----------------------------
docs_csv <- c(
    "https://data.europarl.europa.eu/distribution/plenary-documents_2024_15_en.csv",
    "https://data.europarl.europa.eu/distribution/plenary-documents_2023_52_en.csv",
    "https://data.europarl.europa.eu/distribution/plenary-documents_2022_33_en.csv",
    "https://data.europarl.europa.eu/distribution/plenary-documents_2021_22_en.csv",
    "https://data.europarl.europa.eu/distribution/plenary-documents_2020_14_en.csv",
    "https://data.europarl.europa.eu/distribution/plenary-documents_2019_5_en.csv"
)
# read all .csv at once
docs_list <- lapply(X = docs_csv, readr::read_csv)
# append all .csv together ----------------------------------------------------#
plenary_docs <- data.table::rbindlist(l = docs_list, use.names = T, fill = T)

#------------------------------------------------------------------------------#
# Same thing, as a loop -------------------------------------------------------#
# docs_list <- NULL
# for (i_doc in docs_csv) {
#   print(i_doc)
#   docs_list[[i_doc]] <- read.csv(file = i_doc) }
#------------------------------------------------------------------------------#

rm(docs_list, docs_csv)
gc()
          used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 2661024 142.2    4140331 221.2  4140331 221.2
Vcells 8494063  64.9   25760601 196.6 25756967 196.6

Extract the year feature, similar to what we did above, so that we can group the data afterwards.

Show the code
# extract year
plenary_docs$year <- lubridate::year(plenary_docs$document_date)
# unique(plenary_docs$year) # check

Do we have duplicate entries in the doc id? We can table the number of times each document appears in the file.

Show the code
table(plenary_docs$document_identifier, exclude = NULL) |>
    head(5)

A-8-2019-0001 A-8-2019-0002 A-8-2019-0003 A-8-2019-0004 A-8-2019-0005 
            1             1             1             1             1 

We can then take the mean of the cross-tabulation above.

Show the code
mean(table(plenary_docs$document_identifier, exclude = NULL))
[1] 1

We can then extract the unique names of the Committees in the text.

Show the code
head(sort(unique(plenary_docs$document_creator_organization)))
[1] "Bureau of the European Parliament"                                                                          
[2] "Committee of Inquiry on the Protection of Animals during Transport"                                         
[3] "Committee on Agriculture and Rural Development"                                                             
[4] "Committee on Agriculture and Rural Development; Committee on the Environment, Public Health and Food Safety"
[5] "Committee on Budgetary Control"                                                                             
[6] "Committee on Budgetary Control; Committee on Budgets"                                                       
Show the code
tail(sort(unique(plenary_docs$document_creator_organization)))
[1] "Special committee on financial crimes, tax evasion and tax avoidance"                                                                                                                                                                                                                                                  
[2] "Special Committee on Foreign Interference in all Democratic Processes in the European Union, including Disinformation"                                                                                                                                                                                                 
[3] "The Left group in the European Parliament - GUE/NGL"                                                                                                                                                                                                                                                                   
[4] "The Left group in the European Parliament - GUE/NGL; Group of the Progressive Alliance of Socialists and Democrats in the European Parliament; Group of the Greens/European Free Alliance; European Conservatives and Reformists Group; Renew Europe Group; Group of the European People's Party (Christian Democrats)"
[5] "The Left group in the European Parliament - GUE/NGL; Group of the Progressive Alliance of Socialists and Democrats in the European Parliament; Group of the Greens/European Free Alliance; Renew Europe Group"                                                                                                         
[6] "The Left group in the European Parliament - GUE/NGL; Group of the Progressive Alliance of Socialists and Democrats in the European Parliament; Group of the Greens/European Free Alliance; Renew Europe Group; Group of the European People's Party (Christian Democrats)"                                             

There are way too many values here. We need to filter on the string to get just the pattern of interest, namely Committee on.

Show the code
plenary_docs |>
    dplyr::filter(
        grepl(
            pattern = "Committee on ",
            x = document_creator_organization
        )
    ) |>
    dplyr::group_by(document_creator_organization) |>
    dplyr::summarise(count = n()) |>
    dplyr::arrange(dplyr::desc(count))
# A tibble: 64 × 2
   document_creator_organization                               count
   <chr>                                                       <int>
 1 Committee on Budgetary Control                                381
 2 Committee on Economic and Monetary Affairs                    180
 3 Committee on the Environment, Public Health and Food Safety   175
 4 Committee on Civil Liberties, Justice and Home Affairs        154
 5 Committee on Foreign Affairs                                  124
 6 Committee on Legal Affairs                                    124
 7 Committee on Budgets                                          103
 8 Committee on International Trade                               91
 9 Committee on Transport and Tourism                             79
10 Committee on Industry, Research and Energy                     72
# ℹ 54 more rows

However, we’re not done here. If we actually check the tail of the table above - by simply running tail() -, we can notice something weird.

Show the code
plenary_docs |>
    dplyr::filter(
        grepl(
            pattern = "Committee on ",
            x = document_creator_organization
        )
    ) |>
    tidyr::separate_longer_delim(cols = document_creator_organization, delim = ";") |>
    dplyr::mutate(
        document_creator_organization = trimws(document_creator_organization)
    ) |>
    dplyr::select(document_identifier, document_creator_organization) %>%
    dplyr::group_by(document_identifier) |>
    dplyr::summarise(committee_count = n()) |>
    dplyr::ungroup() |>
    dplyr::arrange(dplyr::desc(committee_count))
# A tibble: 1,975 × 2
   document_identifier committee_count
   <chr>                         <int>
 1 A-9-2021-0255                     3
 2 A-9-2022-0248                     3
 3 A-8-2019-0068                     2
 4 A-8-2019-0087                     2
 5 A-8-2019-0160                     2
 6 A-8-2019-0173                     2
 7 A-8-2019-0175                     2
 8 A-9-2020-0107                     2
 9 A-9-2020-0117                     2
10 A-9-2020-0173                     2
# ℹ 1,965 more rows

We need to split the cells that contain multiple Committees.

Show the code
comm_doc_count <- plenary_docs |>
    dplyr::filter(
        grepl(
            pattern = "Committee on ",
            x = document_creator_organization
        )
    ) |>
    tidyr::separate_longer_delim(cols = document_creator_organization, delim = ";") |>
    dplyr::mutate(
        document_creator_organization = trimws(document_creator_organization)
    ) |>
    dplyr::group_by(document_creator_organization) |>
    dplyr::summarise(doc_count = n()) |>
    dplyr::ungroup() |>
    dplyr::arrange(dplyr::desc(doc_count))

And we can plot to visually inspect that everything went as intended.

Show the code
comm_doc_count |>
    dplyr::filter(doc_count > 1L) |>
    dplyr::mutate(document_creator_organization = forcats::fct_reorder(
        .f = document_creator_organization, .x = doc_count, .fun = max
    )) |>
    ggplot(aes(x = document_creator_organization, y = doc_count)) +
    geom_col(colour = "black", fill = "grey", linewidth = 0.1) +
    coord_flip() +
    labs(x = "", y = "Count") +
    theme_minimal()

Now, try to come up with your solution: what if you wanted also to know the volumes per Committee per year?

Show the code
plenary_docs |>
    dplyr::filter(grepl(
        pattern = "Committee on ",
        x = document_creator_organization
    )) |>
    tidyr::separate_longer_delim(cols = document_creator_organization, delim = ";") |>
    dplyr::mutate(
        document_creator_organization = trimws(
            gsub(
                pattern = "Committee on |Committee on the ", replacement = "",
                x = document_creator_organization
            )
        )
    ) |>
    dplyr::group_by(document_creator_organization, year) |>
    dplyr::summarise(doc_count = n()) |>
    dplyr::ungroup() |>
    dplyr::arrange(dplyr::desc(doc_count)) |>
    dplyr::filter(doc_count > 1L) |>
    ggplot(aes(x = year, y = doc_count)) +
    geom_line(colour = "black", linewidth = 1) +
    facet_wrap(~document_creator_organization,
        ncol = 4,
        labeller = labeller(document_creator_organization = label_wrap_gen(20))
    ) +
    labs(x = "", y = "Count") +
    theme_minimal()

8 Polls (EE, de, means, aggregations)

This section is the most advanced. Polls are extracted from EuropeElects.

Show the code
### --------------------------------------------------------------------------###
## EuropeElects ----------------------------------------------------------------
# Read in data (REF: https://storage.googleapis.com/asapop-website-20220812/csv.html; https://filipvanlaenen.github.io/eopaod/pl.csv)
ee <- data.table::fread(
    # file = here::here("data_in", "pl.csv")
    input = "https://storage.googleapis.com/asapop-website-20220812/_csv/de.csv"
) |>
    janitor::clean_names()
# str(ee)

### --------------------------------------------------------------------------###
## Process the data ------------------------------------------------------------
# Reshape to long
ee_long <- ee |>
    tidyr::pivot_longer(
        cols = cdu_csu:other,
        # cols = c(which(names(ee) == "cdu_csu") : which(names(ee) == "other") ),
        names_to = "party_ee", values_to = "share"
    ) |>
    # recode date and share
    dplyr::mutate(
        fieldwork_start = as.Date(fieldwork_start),
        share = readr::parse_number(share)
    )

### --------------------------------------------------------------------------###
## Plot ------------------------------------------------------------------------
ee_long |>
    dplyr::filter(
        !is.na(party_ee) & !is.na(share) &
            party_ee %in% c(
                "af_d", "cdu", "cdu_csu", "cdu", "fdp", "fw", "grune",
                "link", "spd"
            ) &
            fieldwork_start >= as.Date("2024-01-01")
    ) |>
    ggplot(aes(x = fieldwork_start, y = share)) +
    geom_point(alpha = 0.2) +
    geom_smooth(aes(group = party_ee, colour = party_ee, fill = party_ee),
        method = "loess", alpha = 0.2
    ) +
    labs(
        colour = "", fill = "", x = "",
        title = "Parties' trends in the polls",
        caption = paste0("Source: EuropeElects. Data extracted on ", Sys.Date())
    ) +
    theme_minimal() +
    theme(
        plot.title.position = "plot",
        plot.title = element_text(face = "bold"),
        legend.position = "top"
    )